08_PCA

Author

Edene Levine (s240875), Molly Hong-Minh (s242955), Pablo Fernandez (s243357), Vinit Nilesh Vasa (s242582)

PCA of Antibiotic Resistance Patterns Across Different Species

To explore shared patterns of antibiotic resistance across bacterial species, we performed a Principal Component Analysis (PCA) on the numeric resistance profiles. This approach reveals the main axes of variation in multi-drug resistance and allows us to visualize whether species cluster according to their resistance signatures.

Load Libraries

library("tidyverse")
library("here")
library("broom")

Load Data

MDR_df <- read.csv(here("data/03_dat_aug_wide.csv"))

Convert Resistance Categories to Numeric

In 03_augment.qmd we transform the categorical resistance values (R, I, S) into numeric scores (1, 0.5, 0), allowing us to quantitatively analyze resistance patterns across samples.

Run Initial PCA

We perform a PCA using only the antibiotic-resistance columns, scaling the variables to unit variance to ensure equal contribution across antibiotics.

antibiotics_cols <- colnames(MDR_df)[c(8:(ncol(MDR_df)-1))]

pca_fit <- MDR_df |>
  select(all_of(antibiotics_cols)) |> 
  prcomp(scale = TRUE)
pca_all_plot <- pca_fit |> 
  augment(MDR_df) |> 
  ggplot(mapping = aes(x = .fittedPC1,
                       y = .fittedPC2,
                       color = Species)) +
  geom_point(alpha = 0.2) +
  labs(x = "PC1",
       y = "PC2")

pca_all_plot

As seen in the initial PCA plot, the structure of the data is dominated by the large number of E. coli isolates, making it difficult to observe patterns from other species. This imbalance can also bias the interpretation of the principal components, since one species disproportionately drives the variance. To avoid this dominance and obtain a clearer, more balanced view of resistance patterns, we downsample each species to the same number of isolates.

ggsave(here("results/images/08_PCA_all.png"), pca_all_plot)
Saving 7 x 5 in image

Create Balanced Dataset

We calculate the minimum number of isolates per species to determine the target size for constructing a balanced dataset. We downsample each species to the same number of isolates, preventing highly abundant species from dominating the PCA structure.

min_n <- MDR_df |>
  count(Species) |> 
  summarise(min(n)) |> 
  pull()

balanced_df <- MDR_df |>
  group_by(Species) |>
  slice_sample(n = min_n) |>
  ungroup()

balanced_df |> 
  count(Species)
# A tibble: 9 × 2
  Species                     n
  <chr>                   <int>
1 Acinetobacter baumannii   177
2 Citrobacter spp.          177
3 Enterobacteria spp.       177
4 Escherichia coli          177
5 Klebsiella pneumoniae     177
6 Morganella morganii       177
7 Proteus mirabilis         177
8 Pseudomonas aeruginosa    177
9 Serratia marcescens       177

Run PCA on Balanced Dataset

We repeat the PCA on the balanced dataset.

pca_fit_balanced <- balanced_df |>
  select(all_of(antibiotics_cols)) |> 
  prcomp(scale = TRUE)
pca_downsampling_plot <- pca_fit_balanced |> 
  augment(balanced_df) |> 
  ggplot(mapping = aes(x = .fittedPC1,
                       y = .fittedPC2,
                       color = Species)) +
  geom_point() +
  labs(x = "PC1",
       y = "PC2")

pca_downsampling_plot

The PCA of the balanced dataset shows that most species exhibit substantial overlap in their resistance profiles, indicating that many antibiotic-resistance patterns are shared across taxa. However, some species occupy more distinct regions of the PCA space. For instance, Pseudomonas aeruginosa and Serratia marcescens are positioned predominantly toward the right side of the plot, forming more defined clusters that show limited overlap with the left cloud dominated by Escherichia coli. This separation suggests that these species follow characteristic combinations of resistance traits that differ from the broader and more heterogeneous profiles observed in E. coli and other Enterobacterales. Overall, while many species share common multidrug-resistance patterns, a subset displays more distinct resistance signatures that set them apart from the main continuum.

ggsave(here("results/images/08_PCA_downsampling.png"), pca_downsampling_plot)
Saving 7 x 5 in image